Part III: Machine Learning¶
1.Preprocessing¶
Import packages and load data¶
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.cm as cm # colormaps
from sklearn.metrics import silhouette_samples, silhouette_score
from sklearn.cluster import KMeans
from scipy.cluster.hierarchy import dendrogram
from sklearn.cluster import AgglomerativeClustering
import seaborn as sns
business_file = '../project_data/yelp_business.csv'
review_file = '../project_data/yelp_review.csv'
# business_file = '../project_data/yelp_business.csv'
# review_file = 'yelp_review.csv'
# Load the business data and review data
business_df = pd.read_csv(business_file)
review_df = pd.read_csv(review_file)
Handle Missing Values¶
Dropping Missing Values in Data:
- Rows in
business_dfwith missing values in thecategoriescolumn were dropped. - Rows in
review_dfwith missing values in thetextcolumn were dropped - This step ensures that all businesses have valid category information, which is essential for filtering later.
business_df = business_df.dropna(subset=['categories'])
review_df = review_df.dropna(subset=['text'])
review_df = review_df.sample(frac=0.01, random_state=42)
print(business_df.isnull().sum())
print(review_df.isnull().sum())
print('After dropping NA: business data :', business_df.shape)
print('After dropping NA: review data :', review_df.shape)
business_id 0 name 0 address 5126 city 0 state 0 postal_code 73 latitude 0 longitude 0 stars 0 review_count 0 is_open 0 attributes 13642 categories 0 hours 23120 dtype: int64 review_id 0 user_id 0 business_id 0 stars 0 useful 0 funny 0 cool 0 text 0 date 0 dtype: int64 After dropping NA: business data : (150243, 14) After dropping NA: review data : (69903, 9)
Filtering for Restaurants:
- Filtered businesses with "Restaurants" in their
categoriescolumn intorestaurant_df. - Retained reviews in
review_dfwhosebusiness_idmatchedrestaurant_df. - This ensures the dataset focuses only on restaurants.
# Filter out businesses that are not restaurants
restaurant_df = business_df[business_df['categories'].str.contains('Restaurants')]
restaurant_ids = restaurant_df['business_id'].values
# Filter out reviews that are not for restaurants
restaurant_review_df = review_df[review_df['business_id'].isin(restaurant_ids)]
print('Number of reviews for restaurants:', len(restaurant_review_df))
print(review_df.head())
print(business_df.head())
Number of reviews for restaurants: 47195
review_id user_id \
1295256 J5Q1gH4ACCj6CtQG7Yom7g 56gL9KEJNHiSDUoyjk2o3Q
3297618 HlXP79ecTquSVXmjM10QxQ bAt9OUFX9ZRgGLCXG22UmA
1217795 JBBULrjyGx6vHto2osk_CQ NRHPcLq2vGWqgqwVugSgnQ
3730348 U9-43s8YUl6GWBFCpxUGEw PAxc0qpqt5c2kA0rjDFFAg
1826590 8T8EGa_4Cj12M6w8vRgUsQ BqPR1Dp5Rb_QYs9_fz9RiA
business_id stars useful funny cool \
1295256 8yR12PNSMo6FBYx1u5KPlw 2.0 1 0 0
3297618 pBNucviUkNsiqhJv5IFpjg 5.0 0 0 0
1217795 8sf9kv6O4GgEb0j1o22N1g 5.0 0 0 0
3730348 XwepyB7KjJ-XGJf0vKc6Vg 4.0 0 0 0
1826590 prm5wvpp0OHJBlrvTj9uOg 5.0 0 0 0
text \
1295256 Went for lunch and found that my burger was me...
3297618 I needed a new tires for my wife's car. They h...
1217795 Jim Woltman who works at Goleta Honda is 5 sta...
3730348 Been here a few times to get some shrimp. The...
1826590 This is one fantastic place to eat whether you...
date
1295256 2018-04-04 21:09:53
3297618 2020-05-24 12:22:14
1217795 2019-02-14 03:47:48
3730348 2013-04-27 01:55:49
1826590 2019-05-15 18:29:25
business_id name \
0 Pns2l4eNsfO8kk83dixA6A Abby Rappoport, LAC, CMQ
1 mpf3x-BjTdTEA3yCZrAYPw The UPS Store
2 tUFrWirKiKi_TAnsVWINQQ Target
3 MTSW4McQd7CbVtyjqoe9mw St Honore Pastries
4 mWMc6_wTdE0EUBKIGXDVfA Perkiomen Valley Brewery
address city state postal_code \
0 1616 Chapala St, Ste 2 Santa Barbara CA 93101
1 87 Grasso Plaza Shopping Center Affton MO 63123
2 5255 E Broadway Blvd Tucson AZ 85711
3 935 Race St Philadelphia PA 19107
4 101 Walnut St Green Lane PA 18054
latitude longitude stars review_count is_open \
0 34.426679 -119.711197 5.0 7 0
1 38.551126 -90.335695 3.0 15 1
2 32.223236 -110.880452 3.5 22 0
3 39.955505 -75.155564 4.0 80 1
4 40.338183 -75.471659 4.5 13 1
attributes \
0 {'ByAppointmentOnly': 'True'}
1 {'BusinessAcceptsCreditCards': 'True'}
2 {'BikeParking': 'True', 'BusinessAcceptsCredit...
3 {'RestaurantsDelivery': 'False', 'OutdoorSeati...
4 {'BusinessAcceptsCreditCards': 'True', 'Wheelc...
categories \
0 Doctors, Traditional Chinese Medicine, Naturop...
1 Shipping Centers, Local Services, Notaries, Ma...
2 Department Stores, Shopping, Fashion, Home & G...
3 Restaurants, Food, Bubble Tea, Coffee & Tea, B...
4 Brewpubs, Breweries, Food
hours
0 NaN
1 {'Monday': '0:0-0:0', 'Tuesday': '8:0-18:30', ...
2 {'Monday': '8:0-22:0', 'Tuesday': '8:0-22:0', ...
3 {'Monday': '7:0-20:0', 'Tuesday': '7:0-20:0', ...
4 {'Wednesday': '14:0-22:0', 'Thursday': '16:0-2...
Engineering for Reviews:
- Convert date variable from string to pandas datatime object
- Extract year, month, day, and hour of the review date
- Extract the string length of the review
# Data engineering for review df
# Convert date to datetime format
restaurant_review_df['date'] = pd.to_datetime(
restaurant_review_df['date'], errors='coerce')
# Extract month and year from the date
restaurant_review_df['year'] = restaurant_review_df['date'].dt.year
restaurant_review_df['month'] = restaurant_review_df['date'].dt.month
restaurant_review_df['day'] = restaurant_review_df['date'].dt.day
restaurant_review_df['review_hr'] = restaurant_review_df['date'].dt.hour
restaurant_review_df['review_length'] = \
restaurant_review_df['text'].apply(len)
restaurant_review_df.info()
<class 'pandas.core.frame.DataFrame'> Index: 47195 entries, 1295256 to 4428200 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 review_id 47195 non-null object 1 user_id 47195 non-null object 2 business_id 47195 non-null object 3 stars 47195 non-null float64 4 useful 47195 non-null int64 5 funny 47195 non-null int64 6 cool 47195 non-null int64 7 text 47195 non-null object 8 date 47195 non-null datetime64[ns] 9 year 47195 non-null int32 10 month 47195 non-null int32 11 day 47195 non-null int32 12 review_hr 47195 non-null int32 13 review_length 47195 non-null int64 dtypes: datetime64[ns](1), float64(1), int32(4), int64(4), object(4) memory usage: 4.7+ MB
/var/folders/1d/1xd79pj16zq3cp07nr95yrrr0000gn/T/ipykernel_21341/3095147233.py:3: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy restaurant_review_df['date'] = pd.to_datetime( /var/folders/1d/1xd79pj16zq3cp07nr95yrrr0000gn/T/ipykernel_21341/3095147233.py:6: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy restaurant_review_df['year'] = restaurant_review_df['date'].dt.year /var/folders/1d/1xd79pj16zq3cp07nr95yrrr0000gn/T/ipykernel_21341/3095147233.py:7: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy restaurant_review_df['month'] = restaurant_review_df['date'].dt.month /var/folders/1d/1xd79pj16zq3cp07nr95yrrr0000gn/T/ipykernel_21341/3095147233.py:8: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy restaurant_review_df['day'] = restaurant_review_df['date'].dt.day /var/folders/1d/1xd79pj16zq3cp07nr95yrrr0000gn/T/ipykernel_21341/3095147233.py:9: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy restaurant_review_df['review_hr'] = restaurant_review_df['date'].dt.hour /var/folders/1d/1xd79pj16zq3cp07nr95yrrr0000gn/T/ipykernel_21341/3095147233.py:10: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy restaurant_review_df['review_length'] = \
Filtering for Business:
- Extract the average opening hours from daily opening hours
#Data Engineering for business df
def dict_to_avg_hrs(d):
'''
Convert the dictionary to average hours per day
'''
def str_to_min(s): #convert time str to min
return int(s.split(':')[0]) * 60 + int(s.split(':')[1])
if pd.isnull(d):
return np.nan
else:
d = eval(d) #convert string to dict
hr_dict = {}
for (day, d_info) in d.items():
start, end = d_info.split("-") #get start and end time str
start_min, end_min = str_to_min(start), str_to_min(end)
total_hrs = (end_min - start_min) / 60
if(total_hrs < 0): #if the end time is on the next day
total_hrs += 24
hr_dict[day] = total_hrs
#get average hours per day
return sum(hr_dict.values()) / len(hr_dict)
# Acquire restaurants average opening hours
restaurant_df['avg_opening_hrs'] = \
restaurant_df['hours'].apply(dict_to_avg_hrs)
restaurant_df.info()
<class 'pandas.core.frame.DataFrame'> Index: 52268 entries, 3 to 150340 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 business_id 52268 non-null object 1 name 52268 non-null object 2 address 51825 non-null object 3 city 52268 non-null object 4 state 52268 non-null object 5 postal_code 52247 non-null object 6 latitude 52268 non-null float64 7 longitude 52268 non-null float64 8 stars 52268 non-null float64 9 review_count 52268 non-null int64 10 is_open 52268 non-null int64 11 attributes 51703 non-null object 12 categories 52268 non-null object 13 hours 44990 non-null object 14 avg_opening_hrs 44990 non-null float64 dtypes: float64(4), int64(2), object(9) memory usage: 6.4+ MB
/var/folders/1d/1xd79pj16zq3cp07nr95yrrr0000gn/T/ipykernel_21341/1067097945.py:23: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy restaurant_df['avg_opening_hrs'] = \
Merging Reviews and Businesses:
- Merged
restaurant_review_dfandrestaurant_dfonbusiness_idusing an inner join. - Renamed
stars_xtoreview_starsandstars_ytobusiness_starsfor clarity. - Drop entries where average opening hours is null
merged_df = pd.merge(restaurant_review_df, restaurant_df, on='business_id', how='inner')
merged_df = merged_df.rename(columns={'stars_x': 'review_stars', 'stars_y': 'business_stars'})
#drop null avg_open_hrs to avoid errors in clustering/analysis
merged_df = merged_df[merged_df['avg_opening_hrs'].notnull()]
merged_df.info()
<class 'pandas.core.frame.DataFrame'> Index: 45514 entries, 0 to 47193 Data columns (total 28 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 review_id 45514 non-null object 1 user_id 45514 non-null object 2 business_id 45514 non-null object 3 review_stars 45514 non-null float64 4 useful 45514 non-null int64 5 funny 45514 non-null int64 6 cool 45514 non-null int64 7 text 45514 non-null object 8 date 45514 non-null datetime64[ns] 9 year 45514 non-null int32 10 month 45514 non-null int32 11 day 45514 non-null int32 12 review_hr 45514 non-null int32 13 review_length 45514 non-null int64 14 name 45514 non-null object 15 address 45460 non-null object 16 city 45514 non-null object 17 state 45514 non-null object 18 postal_code 45512 non-null object 19 latitude 45514 non-null float64 20 longitude 45514 non-null float64 21 business_stars 45514 non-null float64 22 review_count 45514 non-null int64 23 is_open 45514 non-null int64 24 attributes 45484 non-null object 25 categories 45514 non-null object 26 hours 45514 non-null object 27 avg_opening_hrs 45514 non-null float64 dtypes: datetime64[ns](1), float64(5), int32(4), int64(6), object(12) memory usage: 9.4+ MB
Identifying Major Cuisine Types:
- Extracted
major_cuisinefromcategoriesbased on predefined types (Italian,Chinese,Mexican, etc.). - Retained only rows with identified
major_cuisinein the filtered dataset. - Focuses the analysis on specific cuisines of interest.
# Define cuisine types of interest
cuisine_types = ["Italian", "Chinese", "Mexican", "Japanese", "American", "Indian"]
# Function to determine the major cuisine type
def get_major_cuisine(categories):
for cuisine in cuisine_types:
if cuisine.lower() in categories.lower():
return cuisine
return None # Return None if no major cuisine is found
# Apply the function to replace categories with the major cuisine type
merged_df['major_cuisine'] = merged_df['categories'].apply(get_major_cuisine)
# Filter to include only rows where a major cuisine was identified
filtered_df = merged_df[merged_df['major_cuisine'].notna()]
# Display the updated DataFrame
print(filtered_df[['categories', 'major_cuisine']].head())
# Check the size of the filtered dataset
print(f"Number of entries with identified major cuisine types: {filtered_df.shape[0]}")
categories major_cuisine 3 American (New), Bars, Sports Bars, Restaurants... American 5 American (New), Restaurants American 7 Food, Sandwiches, American (Traditional), Rest... American 8 Nightlife, Bars, Mexican, Restaurants Mexican 9 Italian, Restaurants Italian Number of entries with identified major cuisine types: 28142
Feature Scaling(Numeric)¶
merged_df.select_dtypes(include=['number']).columns
Index(['review_stars', 'useful', 'funny', 'cool', 'year', 'month', 'day',
'review_hr', 'review_length', 'latitude', 'longitude', 'business_stars',
'review_count', 'is_open', 'avg_opening_hrs'],
dtype='object')
- Scaled
review_count,business_stars,business_stars,latitude,longitude,year,month,day,review_hr,review_length,avg_opening_hrsusing Z-score standardization. - Ensures features have a mean of 0 and a standard deviation of 1.
- Prepares the dataset for clustering and PCA by treating all features equally.
- we exclude
review_starsin PCA data matrix, because that is considered the variable of interest(y label), which is not included in PCA nor clustering
from sklearn.preprocessing import StandardScaler
# Select numerical columns to scale
#numerical_features = ['review_count', 'review_stars', 'business_stars']
numerical_features = [
'review_count','business_stars', 'latitude', 'longitude',
'year', 'month', 'day', 'review_hr', 'review_length',
'avg_opening_hrs']
# Standardization (Z-Score)
standard_scaler = StandardScaler()
# Using .loc to avoid SettingWithCopyWarning
filtered_df.loc[:, numerical_features] = \
standard_scaler.fit_transform(filtered_df[numerical_features])
# Display the scaled dataset
print(filtered_df[numerical_features].head())
review_count business_stars latitude longitude year month \
3 -0.290677 -2.175331 0.776803 0.953303 -1.806387 1.335569
5 0.669448 1.224356 -1.592027 0.430804 1.480650 0.161579
7 -0.311656 -1.325409 0.737862 0.943432 -0.491572 -0.718914
8 -0.356084 -0.475487 -0.728070 -1.502917 -0.162868 -0.718914
9 -0.574518 -0.475487 0.833902 0.945169 -0.491572 1.335569
day review_hr review_length avg_opening_hrs
3 -0.885121 -1.426214 -0.175910 1.849057
5 -1.565276 -1.302427 -0.692089 0.613000
7 0.701906 -1.302427 1.465075 0.458492
8 0.135111 0.059223 -0.470595 -0.983575
9 -0.091607 -1.178641 0.848743 -1.426495
/var/folders/1d/1xd79pj16zq3cp07nr95yrrr0000gn/T/ipykernel_21341/160581636.py:14: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '[-0.29067667 0.66944802 -0.31165626 ... 0.09312639 -0.21169469 -0.29191077]' has dtype incompatible with int64, please explicitly cast to a compatible dtype first. filtered_df.loc[:, numerical_features] = \ /var/folders/1d/1xd79pj16zq3cp07nr95yrrr0000gn/T/ipykernel_21341/160581636.py:14: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '[-1.80638667 1.48064979 -0.49157209 ... 0.8232425 -1.47768302 0.8232425 ]' has dtype incompatible with int32, please explicitly cast to a compatible dtype first. filtered_df.loc[:, numerical_features] = \ /var/folders/1d/1xd79pj16zq3cp07nr95yrrr0000gn/T/ipykernel_21341/160581636.py:14: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '[ 1.33556923 0.16157905 -0.7189136 ... 1.33556923 0.16157905 -0.42541605]' has dtype incompatible with int32, please explicitly cast to a compatible dtype first. filtered_df.loc[:, numerical_features] = \ /var/folders/1d/1xd79pj16zq3cp07nr95yrrr0000gn/T/ipykernel_21341/160581636.py:14: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '[-0.88512109 -1.56527575 0.70190647 ... 1.0419838 1.0419838 1.0419838 ]' has dtype incompatible with int32, please explicitly cast to a compatible dtype first. filtered_df.loc[:, numerical_features] = \ /var/folders/1d/1xd79pj16zq3cp07nr95yrrr0000gn/T/ipykernel_21341/160581636.py:14: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '[-1.42621372 -1.30242731 -1.30242731 ... -1.05485448 -1.1786409 -0.06456318]' has dtype incompatible with int32, please explicitly cast to a compatible dtype first. filtered_df.loc[:, numerical_features] = \ /var/folders/1d/1xd79pj16zq3cp07nr95yrrr0000gn/T/ipykernel_21341/160581636.py:14: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '[-0.17591048 -0.69208909 1.46507526 ... 0.88148526 -0.74216611 -0.71134948]' has dtype incompatible with int64, please explicitly cast to a compatible dtype first. filtered_df.loc[:, numerical_features] = \
Feature Scaling(Categorical)¶
- Encoded
major_cuisineinto binary columns usingOneHotEncoder. - Used
sparse_output=Falseto generate a dense array suitable for DataFrame conversion. - Concatenated the encoded features with
filtered_df. - Dropped the original
major_cuisinecolumn after encoding. - Prepares the categorical data for numerical analysis and machine learning models.
- We decided to exclude
statevariable because it adds a lot of column vectors and has trivial impact on the clustering performance. Lastly, it is less correlated to other variables
from sklearn.preprocessing import OneHotEncoder
import pandas as pd
# Select categorical columns to encode
# categorical_features = ['major_cuisine', 'state']
categorical_features = ['major_cuisine']
# Create OneHotEncoder instance
one_hot_encoder = OneHotEncoder(sparse_output=False) # Set sparse=False to get a dense array
# Fit and transform the categorical features
encoded_features = one_hot_encoder.fit_transform(filtered_df[categorical_features])
# Convert encoded features to a DataFrame
encoded_df = pd.DataFrame(encoded_features, columns=one_hot_encoder.get_feature_names_out(categorical_features))
# Concatenate the encoded features with the original DataFrame
filtered_df = pd.concat([filtered_df.reset_index(drop=True), encoded_df.reset_index(drop=True)], axis=1)
# Drop the original categorical column
filtered_df.drop(columns=categorical_features, inplace=True)
# Display the updated DataFrame
print(filtered_df.head())
review_id user_id business_id \
0 18E_haOfOm8ks-A7SlVWRg bnDZpsii_if2_wpn8oPcig bK0j7YtVyN98UnM_8fUONg
1 c7IQ5alG0pl9yCITtsIlrA ZLKpeCqbCMWfNeT6yU8wUQ zT2OzXDWKK1abapHs2RUrQ
2 YHIicUo2zqA5zwe-lXhsNw CEZMiWrgtF67m0GUm19ZJA nKpWUL3kMt4cnNQhye2WqA
3 3KFxmw4RG5E4ActnP8VPCQ IOJnU62iJL1LM_X6A_p1xw vtR2MjFToKkclbUX5DuhlQ
4 tsCWBn7pc09M3jKiwi4w-g 2ULSyP0EK7LQaavU89efLA eSJMA_VdUVQTDkRJiV9lHw
review_stars useful funny cool \
0 3.0 1 1 1
1 5.0 1 0 0
2 4.0 2 5 1
3 1.0 6 1 0
4 2.0 1 0 0
text date \
0 Dirt cheap happy hour specials. Half priced d... 2011-11-08 01:30:27
1 Philly cheese steak (loaded) was phenomenal. ... 2021-07-02 02:17:40
2 It's almost 10 o clock on a Tuesday and I am t... 2015-04-22 02:01:21
3 Great ambience and seated quickly after arrivi... 2016-04-17 13:31:53
4 Tried this restaurant for the first time tonig... 2015-11-15 03:17:19
year ... attributes \
0 -1.806387 ... {'NoiseLevel': "u'very_loud'", 'RestaurantsPri...
1 1.480650 ... {'RestaurantsReservations': 'False', 'Alcohol'...
2 -0.491572 ... {'BikeParking': 'True', 'GoodForMeal': "{'dess...
3 -0.162868 ... {'CoatCheck': 'False', 'NoiseLevel': "u'averag...
4 -0.491572 ... {'GoodForKids': 'True', 'Alcohol': "u'none'", ...
categories \
0 American (New), Bars, Sports Bars, Restaurants...
1 American (New), Restaurants
2 Food, Sandwiches, American (Traditional), Rest...
3 Nightlife, Bars, Mexican, Restaurants
4 Italian, Restaurants
hours avg_opening_hrs \
0 {'Monday': '11:0-2:0', 'Tuesday': '11:0-2:0', ... 1.849057
1 {'Monday': '10:0-21:0', 'Tuesday': '10:0-21:0'... 0.613000
2 {'Tuesday': '12:0-22:0', 'Wednesday': '12:0-22... 0.458492
3 {'Tuesday': '16:0-21:0', 'Wednesday': '16:0-21... -0.983575
4 {'Wednesday': '16:0-20:0', 'Thursday': '16:0-2... -1.426495
major_cuisine_American major_cuisine_Chinese major_cuisine_Indian \
0 1.0 0.0 0.0
1 1.0 0.0 0.0
2 1.0 0.0 0.0
3 0.0 0.0 0.0
4 0.0 0.0 0.0
major_cuisine_Italian major_cuisine_Japanese major_cuisine_Mexican
0 0.0 0.0 0.0
1 0.0 0.0 0.0
2 0.0 0.0 0.0
3 0.0 0.0 1.0
4 1.0 0.0 0.0
[5 rows x 34 columns]
Dimensionality Reduction¶
- Selected scaled numerical features processed earlier
- Included one-hot encoded categorical features: columns starting with
major_cuisine_. - Combined numerical and categorical features into a new DataFrame
features_df. - Prepares the data for dimensionality reduction techniques like PCA.
# Select one-hot encoded categorical features
categorical_features = [
col for col in filtered_df.columns if col.startswith('major_cuisine_')
]
# Combine scaled numerical features and one-hot encoded categorical features
selected_features = numerical_features + categorical_features
features_df = filtered_df[selected_features]
# Display the prepared DataFrame
print("Prepared Features for Dimensionality Reduction:")
print(features_df.head())
print(f"Shape of features_df: {features_df.shape}")
Prepared Features for Dimensionality Reduction:
review_count business_stars latitude longitude year month \
0 -0.290677 -2.175331 0.776803 0.953303 -1.806387 1.335569
1 0.669448 1.224356 -1.592027 0.430804 1.480650 0.161579
2 -0.311656 -1.325409 0.737862 0.943432 -0.491572 -0.718914
3 -0.356084 -0.475487 -0.728070 -1.502917 -0.162868 -0.718914
4 -0.574518 -0.475487 0.833902 0.945169 -0.491572 1.335569
day review_hr review_length avg_opening_hrs \
0 -0.885121 -1.426214 -0.175910 1.849057
1 -1.565276 -1.302427 -0.692089 0.613000
2 0.701906 -1.302427 1.465075 0.458492
3 0.135111 0.059223 -0.470595 -0.983575
4 -0.091607 -1.178641 0.848743 -1.426495
major_cuisine_American major_cuisine_Chinese major_cuisine_Indian \
0 1.0 0.0 0.0
1 1.0 0.0 0.0
2 1.0 0.0 0.0
3 0.0 0.0 0.0
4 0.0 0.0 0.0
major_cuisine_Italian major_cuisine_Japanese major_cuisine_Mexican
0 0.0 0.0 0.0
1 0.0 0.0 0.0
2 0.0 0.0 0.0
3 0.0 0.0 1.0
4 1.0 0.0 0.0
Shape of features_df: (28142, 16)
Applying PCA and Visualizing Explained Variance:¶
- Applied PCA to the prepared
features_dfto analyze the explained variance of each principal component. - Calculated the individual explained variance ratio and the cumulative explained variance.
- Visualized the results:
- Bar Plot: Shows the individual explained variance for each component.
- Scatter Plot and Line: Represent the cumulative explained variance, helping determine the number of components needed to capture most of the variance.
- Helps identify the optimal number of principal components for dimensionality reduction.
from sklearn.decomposition import PCA
import pandas as pd
# Apply PCA to determine explained variance
pca = PCA() # Let PCA determine all components
pca_data = pca.fit_transform(features_df)
# Explained variance ratio and cumulative explained variance
explained_variance_ratio = pca.explained_variance_ratio_
cumulative_explained_variance = np.cumsum(explained_variance_ratio)
# Plot the explained variance ratio and cumulative explained variance
plt.figure(figsize=(10, 6))
plt.bar(range(1, len(explained_variance_ratio) + 1), explained_variance_ratio, alpha=0.7, label='Individual Explained Variance')
plt.scatter(range(1, len(cumulative_explained_variance) + 1), cumulative_explained_variance, label='Cumulative Explained Variance', color='blue')
plt.plot(range(1, len(cumulative_explained_variance) + 1), cumulative_explained_variance, linestyle='--', color='blue')
# Add labels and title
plt.xlabel('Principal Component Index')
plt.ylabel('Explained Variance Ratio')
plt.title('Explained Variance by Principal Components')
plt.legend(loc='best')
plt.grid(True)
plt.show()
Choosing the Number of Principal Components:¶
- Based on the explained variance plot:
- The first 9 components explain approximately 90% of the variance in the data.
- Beyond 10 components, the cumulative variance increases minimally, indicating diminishing returns.
- Decision:
- Retain 10 components for dimensionality reduction to capture the majority of the information while reducing noise and complexity.
n_components = 10
pca = PCA(n_components=n_components)
# Fit and transform the data
reduced_features = pca.fit_transform(features_df)
# Create a DataFrame for the reduced features
reduced_df = pd.DataFrame(reduced_features,
columns=[f'PC{i+1}' for i in range(n_components)])
# Display the resulting DataFrame
print("Reduced Features with 10 Principal Components:")
print(reduced_df.head())
# Explained variance ratio for the 4 components
explained_variance = pca.explained_variance_ratio_
cumulative_variance = explained_variance.cumsum()
pca_variance_df = pd.DataFrame({
'Explained Variance Ratio': explained_variance,
'Cumulative Explained Variance': cumulative_variance,
}, index=[f'PC{i+1}' for i in range(n_components)])
# print("Explained Variance Ratio:", explained_variance)
# print("Cumulative Explained Variance:", cumulative_variance)
pca_variance_df
Reduced Features with 10 Principal Components:
PC1 PC2 PC3 PC4 PC5 PC6 PC7 \
0 -3.191052 0.447549 0.297477 1.102308 -1.413261 -1.246485 1.492744
1 1.646259 -1.726046 0.265134 -0.171910 -1.519887 -0.930614 1.229623
2 -1.766493 0.974424 -0.355674 0.038239 -0.577182 1.521661 0.985861
3 0.487037 -0.443548 -1.057817 0.151415 -0.337888 0.427243 -1.251094
4 -0.541648 1.648590 -0.809057 0.558286 0.635419 -0.586650 1.328383
PC8 PC9 PC10
0 -0.423566 -0.047467 -0.203261
1 0.994702 0.497637 0.545116
2 0.549622 0.510694 -0.492943
3 -0.336935 -1.142466 -0.437934
4 1.163941 -0.538012 -0.687192
| Explained Variance Ratio | Cumulative Explained Variance | |
|---|---|---|
| PC1 | 0.138477 | 0.138477 |
| PC2 | 0.111295 | 0.249772 |
| PC3 | 0.101976 | 0.351748 |
| PC4 | 0.096944 | 0.448692 |
| PC5 | 0.094133 | 0.542825 |
| PC6 | 0.091539 | 0.634364 |
| PC7 | 0.088010 | 0.722374 |
| PC8 | 0.082884 | 0.805258 |
| PC9 | 0.074648 | 0.879905 |
| PC10 | 0.060540 | 0.940445 |
plt.figure()
top_components = pd.DataFrame(pca.components_.T,
index = features_df.columns)
#plot the principal components in terms of weights of raw variables
sns.heatmap(top_components, cmap='coolwarm', center=0)
plt.title('Top Components of PCA and associated Features')
plt.xlabel('ith PCA Component')
plt.ylabel('Raw Features')
Text(50.5815972222222, 0.5, 'Raw Features')
Results with 10 Principal Components¶
Explained Variance Ratio
- Definition: The proportion of the total variance in the data explained by each principal component.
PC1: Explains 13.85% of the variance.PC2: Explains 11.12% of the variance.PC3: Explains 10.20% of the variance.PC4: Explains 9.69% of the variance.- The remaining explained variance ratio are stored in the left column of the above table
Cumulative Explained Variance
- Definition: The total variance explained when combining multiple components.
PC1 + PC2: Explain 24.97% of the variance.PC1 + PC2 + PC3: Explain 35.17% of the variance.PC1 + PC2 + PC3 + PC4: Explain 44.86% of the variance.
Interpretation of PCA Components
- By visualizing the PCA components with respect to features in the dataframe, we observe that cuisine doesn't provide discriminative information to capturing the variation in the data.
- We also observe that complementary attention across different principal components, which reflects the orthogonality of PCA components
- We interpret some of the salient components. PC1(idx = 0) captures the store-related information. PC2(idx = 1) focuses on macro-review information. PC3(idx = 2)focuses on time information.
Significance:
- Retaining 10 principal components captures over 90% of the variance, meaning most of the information in the dataset is preserved.
- This reduction simplifies the dataset while minimizing information loss.
Application:
- The reduced features can now be used for clustering, classification, or other analyses, with a smaller, more manageable dataset.
2.Clustering and Analysis¶
We perform clustering because we are interested in how review ratings(review_stars) is correlated with other variables. By exploring the hidden nuances among these variables, it better informs restaurants on ways to improve their review ratings. Based on this motivation, clustering better serves this purpose and provide analysis on habits of different customer groups.
K-Means Clustering¶
Intuition: We cluster the data given the informative components from PCA preprocessing. While the PCA plot suggests number of components = 10 is most optimal to between data compression and information perservation, it might not be beneficial in K-means processing, given the tendency of overfitting and noise in lower PCA components. Therefore, we experiment with different number of PCA components and check its silhouette score.
def plot_kmeans_on_pca(X, range_n_clusters, vis = True):
'''
Plot the silhouette score and the KMeans clustering on PCA data
(From Lecture)
'''
for n_clusters in range_n_clusters:
# Initialize the clusterer with n_clusters value and a random generator
# seed of 42 for reproducibility.
clusterer = KMeans(n_clusters=n_clusters, n_init="auto", random_state=42)
cluster_labels = clusterer.fit_predict(X)
#features_df.loc[:, f'cluster_{n_clusters}'] = cluster_labels
# The silhouette_score gives the average value for all the samples.
# This gives a perspective into the density and separation of the formed
# clusters
silhouette_avg = silhouette_score(X, cluster_labels)
print("For n_clusters =", n_clusters,
"The average silhouette_score is :", silhouette_avg)
if(vis == False): continue
fig, (ax1, ax2) = plt.subplots(1, 2)
fig.set_size_inches(18, 7)
# The 1st subplot is the silhouette plot
# The silhouette coefficient can range from -1, 1 but in this example all
# lie within [-0.1, 1]
ax1.set_xlim([-0.1, 1])
# The (n_clusters+1)*10 is for inserting blank space between silhouette
# plots of individual clusters, to demarcate them clearly.
ax1.set_ylim([0, len(X) + (n_clusters + 1) * 10])
# Compute the silhouette scores for each sample
sample_silhouette_values = silhouette_samples(X, cluster_labels)
y_lower = 10
for i in range(n_clusters):
# Aggregate the silhouette scores for samples belonging to
# cluster i, and sort them
ith_cluster_silhouette_values = \
sample_silhouette_values[cluster_labels == i]
ith_cluster_silhouette_values.sort()
size_cluster_i = ith_cluster_silhouette_values.shape[0]
y_upper = y_lower + size_cluster_i
color = cm.nipy_spectral(float(i) / n_clusters)
ax1.fill_betweenx(np.arange(y_lower, y_upper),
0, ith_cluster_silhouette_values,
facecolor=color, edgecolor=color, alpha=0.7)
# Label the silhouette plots with their cluster numbers at the middle
ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))
# Compute the new y_lower for next plot
y_lower = y_upper + 10 # 10 for the 0 samples
ax1.set_title("The silhouette plot for the various clusters.")
ax1.set_xlabel("The silhouette coefficient values")
ax1.set_ylabel("Cluster label")
# The vertical line for average silhouette score of all the values
ax1.axvline(x=silhouette_avg, color="red", linestyle="--")
ax1.set_yticks([]) # Clear the yaxis labels / ticks
ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])
# 2nd Plot showing the actual clusters formed
colors = cm.nipy_spectral(cluster_labels.astype(float) / n_clusters)
ax2.scatter(X[:, 0], X[:, 1], marker='.', s=30, lw=0, alpha=0.7,
c=colors, edgecolor='k')
# Labeling the clusters
centers = clusterer.cluster_centers_
# Draw white circles at cluster centers
ax2.scatter(centers[:, 0], centers[:, 1], marker='o',
c="white", alpha=1, s=200, edgecolor='k')
for i, c in enumerate(centers):
ax2.scatter(c[0], c[1], marker='$%d$' % i, alpha=1,
s=50, edgecolor='k')
ax2.set_title("The visualization of the clustered data.")
ax2.set_xlabel("Feature space for the 1st feature")
ax2.set_ylabel("Feature space for the 2nd feature")
plt.suptitle(("Silhouette analysis for KMeans clustering on sample data "
"with n_clusters = %d" % n_clusters),
fontsize=14, fontweight='bold')
plt.show()
#use first 3 principal components
range_n_clusters = range(2, 10)
X = PCA(n_components=3).fit_transform(features_df)
plot_kmeans_on_pca(X, range_n_clusters)
For n_clusters = 2 The average silhouette_score is : 0.2407215136579905 For n_clusters = 3 The average silhouette_score is : 0.23698273688238533 For n_clusters = 4 The average silhouette_score is : 0.251357852627854 For n_clusters = 5 The average silhouette_score is : 0.22004460454651523 For n_clusters = 6 The average silhouette_score is : 0.2371698021942785 For n_clusters = 7 The average silhouette_score is : 0.22292517315368554 For n_clusters = 8 The average silhouette_score is : 0.22662615111651338 For n_clusters = 9 The average silhouette_score is : 0.22770173176484657
#use first 5 principal components
X = PCA(n_components=5).fit_transform(features_df)
plot_kmeans_on_pca(X, range_n_clusters)
For n_clusters = 2 The average silhouette_score is : 0.15592105387582791 For n_clusters = 3 The average silhouette_score is : 0.14966897726272577 For n_clusters = 4 The average silhouette_score is : 0.156425817630668 For n_clusters = 5 The average silhouette_score is : 0.16138163306363063 For n_clusters = 6 The average silhouette_score is : 0.15799503197248857 For n_clusters = 7 The average silhouette_score is : 0.15510327298427973 For n_clusters = 8 The average silhouette_score is : 0.15208334645648836 For n_clusters = 9 The average silhouette_score is : 0.15665460480553653
#use first 7 principal components
X = PCA(n_components=7).fit_transform(features_df)
plot_kmeans_on_pca(X, range_n_clusters)
For n_clusters = 2 The average silhouette_score is : 0.11639464234039845 For n_clusters = 3 The average silhouette_score is : 0.12007479814750582 For n_clusters = 4 The average silhouette_score is : 0.13275593451270487 For n_clusters = 5 The average silhouette_score is : 0.13367103763742733 For n_clusters = 6 The average silhouette_score is : 0.1202784571421074 For n_clusters = 7 The average silhouette_score is : 0.12186138230209836 For n_clusters = 8 The average silhouette_score is : 0.12310147157357862 For n_clusters = 9 The average silhouette_score is : 0.1215850616589286
Results with PCA + KMeans¶
- By observing the average silhouette scores, we observe that increasing the number of PCA components for KMeans analysis actually greatly harms the performance. The optimal number of PCA components is 3. Regardless, the low silhouette scores seem to suggest that Kmeans might not be the optimal choice
- By applying KMeans on 3-component PCA features, the silhouette scores suggest that optimal number of clusters is 4. Yet, arguably, cluster = 3 could also make sense given that it better separates the data.
- Lastly, we arrive at the conlusion that Kmeans optimal number of clusters is 4. Given that 2D visualization(PCA-components = 2) can't capture the variation of data that lies in higher-dimension manifold, the silhouette score better captures how each cluster is concentrated within and separated across other clusters.
Agglomerative Clustering¶
We now perform agglomerative clustering, which performs hierarchal clustering by merging similar data points. This type of clustering is more adaptive and relies on less assumptions as compared to KMeans
plt.figure()
fig, axes = plt.subplots(5, 2, figsize=(20, 40))
X = PCA(n_components=10).fit_transform(features_df)
for cluster_num in range(2, 12):
i,j = (cluster_num - 2) // 2, (cluster_num - 2) % 2
#perform agglomerative clustering on the PCA data
cluster_labels = AgglomerativeClustering(
n_clusters=cluster_num, linkage='average').fit_predict(X)
#output the silhouette score to evaluate clustering quality
silhouette_avg = silhouette_score(X, cluster_labels)
axes[i,j].scatter(X[:,0], X[:,1], c=cluster_labels)
axes[i,j].set_title(f'{cluster_num} clusters')
print(f'For {cluster_num} clusters, the average silhouette_score is: {silhouette_avg}')
For 2 clusters, the average silhouette_score is: 0.3859069175834477 For 3 clusters, the average silhouette_score is: 0.3577723981351766 For 4 clusters, the average silhouette_score is: 0.29328914391836314 For 5 clusters, the average silhouette_score is: 0.29140454937828475 For 6 clusters, the average silhouette_score is: 0.24558293640073994 For 7 clusters, the average silhouette_score is: 0.2452175148625528 For 8 clusters, the average silhouette_score is: 0.21045756819178682 For 9 clusters, the average silhouette_score is: 0.21021208815497672 For 10 clusters, the average silhouette_score is: 0.17581173152773555 For 11 clusters, the average silhouette_score is: 0.17247068027993778
<Figure size 640x480 with 0 Axes>
#same figure as above but with different number of clusters
plt.figure()
fig, axes = plt.subplots(5, 2, figsize=(20, 40))
X = PCA(n_components=5).fit_transform(features_df)
for cluster_num in range(2, 12):
i,j = (cluster_num - 2) // 2, (cluster_num - 2) % 2
cluster_labels = AgglomerativeClustering(
n_clusters=cluster_num, linkage='average').fit_predict(X)
silhouette_avg = silhouette_score(X, cluster_labels)
axes[i,j].scatter(X[:,0], X[:,1], c=cluster_labels)
axes[i,j].set_title(f'{cluster_num} clusters')
print(f'For {cluster_num} clusters, the average silhouette_score is: {silhouette_avg}')
For 2 clusters, the average silhouette_score is: 0.4380196805494214 For 3 clusters, the average silhouette_score is: 0.2953958156906736 For 4 clusters, the average silhouette_score is: 0.26348655335428267 For 5 clusters, the average silhouette_score is: 0.17919400592690682 For 6 clusters, the average silhouette_score is: 0.13344845991725302 For 7 clusters, the average silhouette_score is: 0.13289440471072397 For 8 clusters, the average silhouette_score is: 0.10324328871075238 For 9 clusters, the average silhouette_score is: 0.05963217638982547 For 10 clusters, the average silhouette_score is: 0.05914563262152535 For 11 clusters, the average silhouette_score is: 0.051686644796053886
<Figure size 640x480 with 0 Axes>
plt.figure()
fig, axes = plt.subplots(4, 2, figsize=(20, 40))
agg_df = features_df.copy()
agg_df['review_stars'] = filtered_df['review_stars']
X = PCA(n_components=3).fit_transform(features_df)
for cluster_num in range(2, 10):
i,j = (cluster_num - 2) // 2, (cluster_num - 2) % 2
cluster_labels = AgglomerativeClustering(
n_clusters=cluster_num, linkage='average').fit_predict(X)
agg_df[f'cluster_{cluster_num}'] = cluster_labels
silhouette_avg = silhouette_score(X, cluster_labels)
axes[i,j].scatter(X[:,0], X[:,1], c=cluster_labels)
axes[i,j].set_title(f'{cluster_num} clusters')
print(f'For {cluster_num} clusters, the average silhouette_score is: {silhouette_avg}')
/var/folders/1d/1xd79pj16zq3cp07nr95yrrr0000gn/T/ipykernel_21341/3733346113.py:3: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy features_df.drop(columns=['review_stars'], inplace=True)
For 2 clusters, the average silhouette_score is: 0.5176685576133001 For 3 clusters, the average silhouette_score is: 0.5083609340146049 For 4 clusters, the average silhouette_score is: 0.38915342785638585 For 5 clusters, the average silhouette_score is: 0.24581118082182937 For 6 clusters, the average silhouette_score is: 0.16728503970849987 For 7 clusters, the average silhouette_score is: 0.16480647142682803 For 8 clusters, the average silhouette_score is: 0.1206762754372233 For 9 clusters, the average silhouette_score is: 0.11960952215844287
<Figure size 640x480 with 0 Axes>
Discussions to Agglomerative Clustering¶
Agglomerative Clustering Results
- By evaluating the silhouette score, we observe that utilizing 3 PCA components yields the best results. Among it, employing number of cluster = 2 or 3 have optimal performance.
- We also observe that increasing number of clusters would greatly degrade the clustering performance, given its tendency to overfit.
Agglomerative Clustering vs. KMeans
- As compared to KMeans, Agglomerative Clustering is significantly more robust to the PCA data matrix, with less fluncutation in performance.
- Agglomerative clustering is more jusitified in the Yelp datasets because the distributions of reviews in the data manifold is usually non-uniform across clusters. It forms clustering based on similarity across data points and constructed internal nodes. On the other hand, KMeans relies on assumption of spherical clusters, equal cluster size, and similar extent of variability. Therefore, it is sensitive to noise or additional features(e.g. more PCA components) that would greatly violate these assumptions. It is also sensitive to the initialization of centroids.
#Evaluate clusters against review_stars(cluster = 2)
per_cluster_rating = agg_df.groupby('cluster_2')['review_stars'].mean()
per_cluster_rating.plot(kind='bar')
plt.title('Average Review Stars per Cluster')
plt.ylabel('Average Review Stars')
plt.xlabel('Cluster Number')
Text(0.5, 0, 'Cluster Number')
features = [f for f in agg_df.columns if 'cluster' not in f]
#analyze the mean of each feature per cluster and what was embodied
cluster_df = agg_df.groupby('cluster_2')[features].mean()
feature_std = agg_df[features].std()
top_5_features = feature_std.nlargest(5)
cluster_df.loc[:, top_5_features.index]
| review_stars | review_count | business_stars | longitude | year | |
|---|---|---|---|---|---|
| cluster_2 | |||||
| 0 | 3.786011 | -0.003737 | -0.000827 | 0.000103 | 0.000379 |
| 1 | 3.105263 | 5.530932 | 1.224356 | -0.153185 | -0.560773 |
#Evaluate clusters against review_stars(cluster = 3)
per_cluster_rating = agg_df.groupby('cluster_3')['review_stars'].mean()
per_cluster_rating.plot(kind='bar')
plt.title('Average Review Stars per Cluster')
plt.ylabel('Average Review Stars')
plt.xlabel('Cluster Number')
Text(0.5, 0, 'Cluster Number')
features = [f for f in agg_df.columns if 'cluster' not in f]
#analyze the mean of each feature per cluster and what was embodied
cluster_df = agg_df.groupby('cluster_3')[features].mean()
#select features that have the highest variance(most distinguishing)
feature_std = agg_df[features].std()
top_5_features = feature_std.nlargest(5)
cluster_df.loc[:, top_5_features.index]
| review_stars | review_count | business_stars | longitude | year | |
|---|---|---|---|---|---|
| cluster_3 | |||||
| 0 | 2.375000 | 0.017162 | -0.173367 | -0.044754 | -1.137423 |
| 1 | 3.105263 | 5.530932 | 1.224356 | -0.153185 | -0.560773 |
| 2 | 3.798974 | -0.003929 | 0.000758 | 0.000516 | 0.010831 |
Discussion of Clustering Results With Respect to Features¶
- The PCA matrix suggests that cuisine of the restaurant has limited impact on informing other review or restaurant-related features.
- Both clustering results suggest that the restaurant reviews can be clustered into 2-3 groups. Excessive amount of clustering easily overfits the clustering performance and introduces noise
- We analyze what each review cluster represent by analyzing the mean features of each cluster. We observe that it is discriminative across review_count, business_stars, longitude, and year. Moreover, it successfully correlates with the target of interest, the review_stars variable. This suggests that the clustering is effective and able to correspond to rating segments. We also observe that the year is positively correlated with the review_stars based on the clustered data frame.
- By performing clustering, we successfully analyzed the review trends, which could provide insights to restaurants on ways to improve their ratings as well as customer inclinations.